Discriminative Clustering of Text Documents
نویسندگان
چکیده
Vector-space and distributional methods for text document clustering are discussed. Discriminative clustering, a recently proposed method, uses external data to find taskrelevant characteristics of the documents, yet the clustering is defined even with no external data. We introduce a distributional version of discriminative clustering that represents text documents as probability distributions. The methods are tested in the task of clustering scientific document abstracts, and the ability of the methods to predict an independent topical classification of the abstracts is compared. The discriminative methods found topically more meaningful clusters than the vector space and distributional clustering models.
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملTCUAP: A Novel Approach of Text Clustering Using Asymmetric Proximity
Text documents have sparse data spaces and current existing methods of text clustering use symmetry proximity to measure the correlation of documents. In this paper, we propose a novel approach to strengthen the discriminative feature of document objects, which uses asymmetric proximity for text clustering. We present a measure of asymmetric proximity between documents and between clusters. TCU...
متن کاملText Document Clustering Using DPM with Concept and Feature Analysis
Clustering is one of the most important techniques in machine learning and data mining tasks. Similar documents are grouped by performing clustering techniques. Similarity measuring is used to determine transaction relationships. Hierarchical clustering model produces tree structured results. Partitioned based clustering produces the outcome in grid format. Text documents are unstructured data ...
متن کاملConcept Chain Based Text Clustering
Different from familiar clustering objects, text documents have sparse data spaces. A common way of representing a document is as a bag of its component words, but the semantic relations between words are ignored. In this paper, we propose a novel document representation approach to strengthen the discriminative feature of document objects. We replace terms of documents with concepts in WordNet...
متن کاملLocally discriminative topic modeling
Topic modeling is a powerful tool for discovering the underlying or hidden structure in text corpora. Typical algorithms for topic modeling include probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA). Despite their different inspirations, both approaches are instances of generative model, whereas the discriminative structure of the documents is ignored. In this p...
متن کامل